{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 稀疏矩阵之 COO, CSR, CSC\n",
    "\n",
    "本文讨论 scipy 中的稀疏向量，因为前几天在使用 `pandas.get_dummies` 时遇到了一些问题：\n",
    "\n",
    "```python\n",
    "X = pd.get_dummies(dataset, columns=dataset.columns, sparse=True).to_coo()\n",
    "\n",
    "X_train = X[:train.shape[0]]\n",
    "X_test = X[train.shape[0]:]\n",
    "```\n",
    "\n",
    "上面的代码会报错，因为 COO 格式的稀疏矩阵不能进行切片。需要下面这样才行：\n",
    "\n",
    "```python\n",
    "X = X.tocsr()\n",
    "X_train = X[:train.shape[0]]\n",
    "X_test = X[train.shape[0]:]\n",
    "```\n",
    "\n",
    "今天花了点时间，了解了下稀疏矩阵常见的这三种格式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### COO - Coordinate Format\n",
    "\n",
    "COO 格式通过坐标来定义稀疏矩阵，这时最符合直觉的一种格式，只需要指定非零元素的行、列、具体数值即可。下面是个例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  (0, 2)\t1\n",
      "  (1, 1)\t2\n",
      "  (2, 0)\t3\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "from scipy.sparse import coo_matrix\n",
    "\n",
    "row  = np.array([0, 1, 2])\n",
    "col  = np.array([2, 1, 0])\n",
    "data = np.array([1, 2, 3])\n",
    "\n",
    "print(coo_matrix((data, (row, col)), shape=(3, 3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以使用 `toarray()` 或者 `todense` 转为稠密数组或者矩阵。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 1],\n",
       "       [0, 2, 0],\n",
       "       [3, 0, 0]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coo_matrix((data, (row, col)), shape=(3, 3)).toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果在同一个位置上指定了多个数，那么在转为稠密矩阵或者转为其他格式的时候，相同位置额值会累加起来："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0,  0, 10],\n",
       "       [ 0,  2,  0],\n",
       "       [ 3,  0,  0]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "row  = np.array([0, 1, 2, 0])\n",
    "col  = np.array([2, 1, 0, 2])\n",
    "data = np.array([1, 2, 3, 9])\n",
    "\n",
    "A = coo_matrix((data, (row, col)), shape=(3, 3))\n",
    "A.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "COO 格式便于手动构造稀疏矩阵，但是它存在诸多短板，没办法做切片。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "'coo_matrix' object is not subscriptable",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-4-07c86f5718f6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mA\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m: 'coo_matrix' object is not subscriptable"
     ]
    }
   ],
   "source": [
    "A[:1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了支持行和列的切片，需要使用 CSR 和 CSC 格式。COO 格式可以使用 `tocsr` 和 `tocsc` 来转换格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3x3 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 3 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A.tocsr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CSR - Compressed Sparse Row\n",
    "\n",
    "使用稀疏矩阵其中一个动机就是节省存储空间，CSR 格式在存储的时候是按行压缩存储空间，COO 格式中，每个元素都需要三个值（行+列+值）来描述，在 CSR 和后面提到的 CSC 中能够使用更少的信息来表示稀疏矩阵。\n",
    "\n",
    "可以使用 和 COO 相同的方式来创建 CSR 和 CSC 格式的系数矩阵。行+列+值，就可以了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 0, 2],\n",
       "       [0, 0, 3],\n",
       "       [4, 5, 6]], dtype=int64)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.sparse import csr_matrix\n",
    "\n",
    "row  = np.array([0, 0, 1, 2, 2, 2])\n",
    "col  = np.array([0, 2, 2, 0, 1, 2])\n",
    "data = np.array([1, 2, 3, 4, 5, 6])\n",
    "csr_matrix((data, (row, col)), shape=(3, 3)).toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在 CSR 中还有另一种创建方式，其参数形式如下：\n",
    "\n",
    "```python\n",
    "csr_matrix((data, indices, indptr), [shape=(M, N)])\n",
    "```\n",
    "\n",
    "下面举例说明："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 2, 0],\n",
       "       [0, 0, 3],\n",
       "       [4, 5, 6]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indptr = np.array([0, 2, 3, 6])\n",
    "indices = np.array([0, 1, 2, 0, 1, 2])\n",
    "data    = np.array([1, 2, 3, 4, 5, 6])\n",
    "A = csr_matrix((data, indices, indptr), shape=(3, 3))\n",
    "A.toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这是一种新的创建 CSR 矩阵的形式，在 CSR 格式中，数据是按行存储在一个列表中的，下面举例说明各个参数的含义：\n",
    "\n",
    "`indptr` 中包含 4 个元素，这 4 个元素可以确定 3 个区间，`0-2, 2-3, 3-6`。以 `0-2` 为例：\n",
    "\n",
    "```python\n",
    "indices[0:2] -> [0, 1]\n",
    "```\n",
    "\n",
    "这说明，在第 1 行中 `0, 1` 两列有值。\n",
    "\n",
    "```python\n",
    "data[0:2] -> [1, 2]\n",
    "```\n",
    "\n",
    "这说明第一行中的值是 `1, 2`。\n",
    "\n",
    "有了 indices 指示的列信息，以及 data 中指示的数值信息，稀疏矩阵就能确定了。\n",
    "\n",
    "同理，`2-3` 区间就用于确定第 2 行的列和数值信息：\n",
    "\n",
    "```python\n",
    "indices[2:3] -> [2] # 列号\n",
    "data[2:3] -> [3] # 值\n",
    "```\n",
    "\n",
    "说明：`A[1, [2]] = [3]`\n",
    "\n",
    "CSR 可以支持切片，在做行切片的时候速度很快，这得益于其存储格式。在做列切片的时候就速度就较慢了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "311 µs ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
      "428 µs ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "row = np.random.randint(0, 2000, 50000)\n",
    "col = np.random.randint(0, 2000, 50000)\n",
    "data = np.ones(50000)\n",
    "A = csr_matrix((data, (row, col)))\n",
    "%timeit _ = A[:1000,:]\n",
    "%timeit _ = A[:,:5000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CSC - CCompressed Sparse Column\n",
    "\n",
    "CSC 格式和 CSR 格式类似，只是在压缩的时候是按列压缩。CSC 格式在做列切片的时候会比较快。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 1, 4],\n",
       "       [0, 0, 5],\n",
       "       [2, 3, 6]], dtype=int64)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.sparse import csc_matrix\n",
    "\n",
    "row  = np.array([0, 2, 2, 0, 1, 2])\n",
    "col  = np.array([1, 0, 1, 2, 2, 2])\n",
    "data = np.array([1, 2, 3, 4, 5, 6])\n",
    "csc_matrix((data, (row, col)), shape=(3, 3)).toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 0, 4],\n",
       "       [0, 0, 5],\n",
       "       [2, 3, 6]])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indptr = np.array([0, 2, 3, 6])\n",
    "indices = np.array([0, 2, 2, 0, 1, 2])\n",
    "data = np.array([1, 2, 3, 4, 5, 6])\n",
    "csc_matrix((data, indices, indptr), shape=(3, 3)).toarray()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
